NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An Interpolative Slicing Algorithm for Continuously Graded Stiffness in Viscous Thread Printed Foams

NAKURA, MASA; SARKAR, VIVEK; REVIER, DANIEL; EMERY, BRETT; LIPTON, JEFFREY I (August 2024, The Minerals, Metals and Materials Society)

Foams, essential for applications from car seats to thermal insulation, are limited by traditional manufacturing techniques that struggle to produce graded stiffness, a key feature for enhanced functionality. Here, we introduce a novel slicing algorithm for producing heterogeneous foams through viscous thread printing (VTP). Our slicer generates a single, global toolpath for the entire foam volume while modulating the viscous thread’s self-interactions along this path to program stiffness. The slicer integrates multiple meshes into a unified print space and interpolates the print speed and height based on specified mesh parameters to program the desired stiffness variations. Using both qualitative samples and quantitative compression tests, we demonstrate that our slicer can (1) generate foam stiffnesses spanning an order of magnitude, (2) achieve millimeter precision in stiffness control, and (3) continuously vary stiffness between regions of constant stiffness using arbitrary functional forms.
more » « less
Full Text Available
Leibniz International Proceedings in Informatics (LIPIcs):37th European Conference on Object-Oriented Programming (ECOOP 2023)

https://doi.org/10.4230/LIPIcs.ECOOP.2023.13

Jin, Feiyang; Yu, Lechen; Cogumbreiro, Tiago; Shirako, Jun; Sarkar, Vivek (January 2023, European Conference on Object-Oriented Programming)
Ali, Karim; Salvaneschi, Guido (Ed.)
Much of the past work on dynamic data-race and determinacy-race detection algorithms for task parallelism has focused on structured parallelism with fork-join constructs and, more recently, with future constructs. This paper addresses the problem of dynamic detection of data-races and determinacy-races in task-parallel programs with promises, which are more general than fork-join constructs and futures. The motivation for our work is twofold. First, promises have now become a mainstream synchronization construct, with their inclusion in multiple languages, including C++, JavaScript, and Java. Second, past work on dynamic data-race and determinacy-race detection for task-parallel programs does not apply to programs with promises, thereby identifying a vital need for this work. This paper makes multiple contributions. First, we introduce a featherweight programming language that captures the semantics of task-parallel programs with promises and provides a basis for formally defining determinacy using our semantics. This definition subsumes functional determinacy (same output for same input) and structural determinacy (same computation graph for same input). The main theoretical result shows that the absence of data races is sufficient to guarantee determinacy with both properties. We are unaware of any prior work that established this result for task-parallel programs with promises. Next, we introduce a new Dynamic Race Detector for Promises that we call DRDP. DRDP is the first known race detection algorithm that executes a task-parallel program sequentially without requiring the serial-projection property; this is a critical requirement since programs with promises do not satisfy the serial-projection property in general. Finally, the paper includes experimental results obtained from an implementation of DRDP. The results show that, with some important optimizations introduced in our work, the space and time overheads of DRDP are comparable to those of more restrictive race detection algorithms from past work. To the best of our knowledge, DRDP is the first determinacy race detector for task-parallel programs with promises.
more » « less
ReACT: Redundancy-Aware Code Generation for Tensor Expressions

https://doi.org/10.1145/3559009.3569685

Zhou, Tong; Tian, Ruiqin; Ashraf, Rizwan A.; Gioiosa, Roberto; Kestor, Gokcen; Sarkar, Vivek (October 2022, ACM)
SHMEM-ML: Leveraging OpenSHMEM and Apache Arrow for Scalable, Composable Machine Learning

https://doi.org/10.1007/978-3-031-04888-3_7

Grossman, Max; Poole, Steve; Pritchard, Howard; Sarkar, Vivek (May 2022, Lecture notes in computer science)
Poole, Steve; Hernandez, Oscar; Baker, Matthew; Curtis, Tony (Ed.)
SHMEM-ML is a domain specific library for distributed array computations and machine learning model training & inference. Like other projects at the intersection of machine learning and HPC (e.g. dask, Arkouda, Legate Numpy), SHMEM-ML aims to leverage the performance of the HPC software stack to accelerate machine learning workflows. However, it differs in a number of ways. First, SHMEM-ML targets the full machine learning workflow, not just model training. It supports a general purpose nd-array abstraction commonly used in Python machine learning applications, and efficiently distributes transformation and manipulation of this ndarray across the full system. Second, SHMEM-ML uses OpenSHMEM as its underlying communication layer, enabling high performance networking across hundreds or thousands of distributed processes. While most past work in high performance machine learning has leveraged HPC message passing communication models as a way to efficiently exchange model gradient updates, SHMEM-ML’s focus on the full machine learning lifecycle means that a more flexible and adaptable communication model is needed to support both fine and coarse grain communication. Third, SHMEM-ML works to interoperate with the broader Python machine learning software ecosystem. While some frameworks aim to rebuild that ecosystem from scratch on top of the HPC software stack, SHMEM-ML is built on top of Apache Arrow, an in-memory standard for data formatting and data exchange between libraries. This enables SHMEM-ML to share data with other libraries without creating copies of data. This paper describes the design, implementation, and evaluation of SHMEM-ML – demonstrating a general purpose system for data transformation and manipulation while achieving up to a 38× speedup in distributed training performance relative to the industry standard Horovod framework without a regression in model metrics.
more » « less
Full Text Available
Memory access scheduling to reduce thread migrations

https://doi.org/10.1145/3497776.3517768

Damani, Sana; Barua, Prithayan; Sarkar, Vivek (March 2022, ACM)
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

https://doi.org/10.1145/3485137

Chatarasi, Prasanth; Kwon, Hyoukjun; Parashar, Angshuman; Pellauer, Michael; Krishna, Tushar; Sarkar, Vivek (March 2022, ACM Transactions on Architecture and Code Optimization)

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers.
more » « less
Full Text Available
Task-graph scheduling extensions for efficient synchronization and communication

https://doi.org/10.1145/3447818.3461616

Bak, Seonmyeong; Hernandez, Oscar; Gates, Mark; Luszczek, Piotr; Sarkar, Vivek (June 2021, 35th ACM International Conference on Supercomputing (ICS))

Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
more » « less
Full Text Available
Foams with 3D Spatially Programmed Mechanics Enabled by Autonomous Active Learning on Viscous Thread Printing

https://doi.org/10.1002/advs.202408062

Emery, Brett; Snapp, Kelsey_L; Revier, Daniel; Sarkar, Vivek; Nakura, Masa; Brown, Keith_A; Lipton, Jeffrey_Ian (September 2024, Advanced Science)

Abstract Foams are versatile by nature and ubiquitous in a wide range of applications, including padding, insulation, and acoustic dampening. Previous work established that foams 3D printed via Viscous Thread Printing (VTP) can in principle combine the flexibility of 3D printing with the mechanical properties of conventional foams. However, the generality of prior work is limited due to the lack of predictable process‐property relationships. In this work, a self‐driving lab is utilized that combines automated experimentation with machine learning to identify a processing subspace in which dimensionally consistent materials are produced using VTP with spatially programmable mechanical properties. In carrying out this process, an underlying self‐stabilizing characteristic of VTP layer thickness is discovered as an important feature for its extension to new materials and systems. Several complex exemplars are constructed to illustrate the newly enabled capabilities of foams produced via VTP, including 1D gradient rectangular slabs, 2D localized stiffness zones on an insole orthotic and living hinges, and programmed 3D deformation via a cable‐driven humanoid hand. Predictive mapping models are developed and validated for both thermoplastic polyurethane (TPU) and polylactic acid (PLA) filaments, suggesting the ability to train a model for any material suitable for material extrusion (ME) 3D printing.
more » « less
OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing

https://doi.org/10.1007/978-3-030-57675-2_13

Barua, Prithayan; Zhao, Jisheng; Sarkar, Vivek (August 2020, European Conference on Parallel Processing (Euro-Par 2020))
null (Ed.)
The fast development of acceleration architectures and applications has made heterogeneous computing the norm for high- performance computing. The cost of high volume data movement to the accelerators is an important bottleneck both in terms of application performance and developer productivity. Memory management is still a manual task performed tediously by expert programmers. In this paper, we develop a compiler analysis to automate memory management for heterogeneous computing. We propose an optimization framework that casts the problem of detection and removal of redundant data move- ments into a partial redundancy elimination (PRE) problem and applies the lazy code motion technique to optimize these data movements. We chose OpenMP as the underlying parallel programming model and imple- mented our optimization framework in the LLVM toolchain. We evalu- ated it with ten benchmarks and obtained a geometric speedup of 2.3×, and reduced on average 50% of the total bytes transferred between the host and GPU.
more » « less
Full Text Available
Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

Chatarasi, Prasanth; Neuendorffer, Stephen; Bayliss, Samuel; Vissers, Kees; Sarkar, Vivek (September 2020, 2020 IEEE High Performance Extreme Computing Virtual Conference)
null (Ed.)
Xilinx’s AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly- level programming, it requires the programmer to specify a number of low-level operations based on detailed knowledge of the hardware. To address these challenges, we introduce Vyasa, a new programming system that extends the Halide DSL compiler to automatically generate code for the AI Engine. We evaluated Vyasa on 36 CONV2D workloads, and achieved geometric means of 7.6 and 24.2 MACs/cycle for 32-bit and 16-bit operands (which represent 95.9% and 75.6% of the peak performance respectively).
more » « less
Full Text Available

« Prev Next »

Search for: All records